The database that I’ll be exploring consists of red variants of the Portuguese “Vinho Verde” wine. The variables being explored are shown below.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Let’s first take a look at a summary of the red wine database variables to get a better idea of the data.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The mean and median wine quality are close to the mid point (quality is measured in a scale of 0 to 10 from the median of at least 3 evaluations made by wine experts).
## [1] "percentage of wines with quality 5 or 6"
## [1] 82.48906
The median and mean of wine quality is very representative of the entire dataset.
The distribution of alcohol in wine is interesting because we can see spikes at values ending with .5 or 0 which suggests that wineries tend to round their abv values. I expected abv and density to have an obvious relationship but density seems to be more affected by the amount of sugar in the wine rather than the volume of alcohol. As expected the density of the wine is very close to the density of water. The average density of the wines in the database is .997 g/cm^3 and the density of water is 1 g/cm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
It is surprising to see that wineries measure pH with high precision. Since the great majority of wines fall within the 3-3.5 pH range it will be interesting to investigate if wine quality is affected by small variations of acidity or if quality remains constant for a certain pH range. pH is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
pH is a measure of acidity so it makes sense that the distributions for volatile, free acidity, and pH are all similar. The distribution for volatile acidity appears to be bimodal, while fixed acidity and pH have a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of fixed acidity and volatile acidity are similar to the pH distribution as expected. In the citric acid plot we again see peaks at exact values (0, .25, .5) and an almost uniform distribution, which is surprising since I expected it to have a strong correlation with pH. It might be worth it to explore this relationship.
The distribution of sugar and salt suggests that there is a dependency between the two. Since, both plots have a long positive tail lets apply a log transformation to them.
After the transformation the possible relation between the two concentrations becomes more obvious. We see a lone low value on each and a similar distribution for values on the higher end. The balance between the two seems like a possible variable with a strong correlation to wine quality and will be explored in more detail later. Meanwhile, lets see how the distributions of salt and sugar vary for wines of each quality
From these group of plots it looks like chloride (salt) and sugar in the wine doesn’t greatly impact its quality. The concentrations have approximately the same distribution for all wines. However, since the amount of data on wines with quality outside the 5-7 range is limited the patterns of their distributions are not fully formed.
## [1] "Free Sulfur Dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] "Total Sulfur Dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] "Sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Nothing surprising here. Low concentrations of SO2 in the wine and similar distributions for amount of SO2 and sulphates in the wine. A new variable, called SO2.ratio, is added to the “red” table to represent the amount of free SO2 relative to the total.
What is the structure of your dataset?
There are 1599 observations in the dataset with 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free SO2, total SO2, density, pH, sulphates, alcohol, quality). Only the quality is an ordered factor variable with the following levels.
quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
However, there are no quality observations outside of the 3-8 range and more than 80 percent are inside the 5-6 levels.
Other observations
* The mean wine quality is 5.64 and the madian is 6
* Most wines have a pH between 3 and 3.5
* There seems to be a dependency between concentrations of sugar and salt
* The average alcohol content is 10.42%
What is/are the main feature(s) of interest in your dataset?
The main feature in the dataset is quality. Although, the quality observations are limited I hope to determine which variables have the greatest impact on percieved quality.
What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?
Alcohol, pH, and citric acid are the most likely to influence the quality of the wine.
Did you create any new variables from existing variables in the dataset?
I created a ratio of free and total SO2 concentrations
Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
The distributions for the concentrations of residual sugar and chlorides are long-tailed, so I performed a log transformation on them. The new distributions are close to normal. There is a clear relationship between salt and sugar concentrations.
To get a better idea of the relationship between the variables I’ll take a quick look at a correlation matrix. I’ll compare alcohol, density, pH, SO2 ratio, and quality to the other variables as those two are the ones that peaked my interest during the univariate analysis.
The correlation results are disappointing, there isn’t any strong correlation between the variables. However, from the matrix it looks like volatile.acidity, citric.acid, SO2.ratio, alcohol, sulphates, and density have the greater effect on a wine quality, which is the variable we are most interested in. Therefore, these variables will be analyzed in more detail.
I decided to explore all variables with correlations greater than .2 with quality.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
As volatile acidity decreases the quality of the wine decreases. The relationship appears to settle down once volatile acidity is around .4 g/L.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
This is a fun one. As alcohol content goes up so does the perceived quality. I wonder if that could be biased in some obsure way. As expected, wine density is dependent on the alcohol content.
This is a well known physical property and is of no interest in this study because density doesn’t not have a noticeable effect on wine quality as seen below.
Therefore, density is not a factor in the alcohol/quality relationship.
There is not much going on here. Quality appears to increase as the concentration of sulphates in the wine increases but the variation is too small to make anyting of it. Sulphates have a negative correlation with volatile acidity, so it makes sense that more sulphates could result in better wine quality as it would decrease volatile acidity.
## red$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## red$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## red$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## red$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## red$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## red$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
Citric acid is known to add flavor and freshness to wines. This plot shows that is a good thing. Clearly an increase in citric acid in the wine tends to increase its quality. However, too much citric acid results in below average wines. The two wines with the highest citric acid concentration are below average quality (4 and 5). Unintuitively, volatile acidity decreases as citric acid and fixed acidity increase. Therfore, quality should also increase with fixed acidity.
Variations in fixed acidity does not have the expected impact on wine quality. The relationship between citric acid, fixed acidity, and quality is something I want to investigate in more detail. Meanwhile, lets get a better understanding about the differences between volatile and fixed acidity. Fixed acidity is found naturally in grapes or are created through the normal fermentation process. Volatile acids are produced through fermentations caried out by spoilage organisms. The most common volatile acid is acetic acid, which is produced by bacteria as it ferments wine into vinegar! This explains their different relationships with wine quality.
Let’s see if there is an observable relationship between the concentration of sugar and salt.
The amount of salt in wine looks to be mostly constant except for a few wines with higher concentrations. I wonder if this has anything to do wine quality.
This plot is misleading and not useful. It suggests that average wines tend to have larger salt or sugar concentrations. However, it only appears to be this way because we have many more datapoints at these quality values, which increases the possibility of outliers in our plot.
Since most of the wines fall within the 5-6 quality range it makes sense to create three categories for wines as to maximize the datapoints in each and have a better change to detect patterns. The categories I’ll use are below average (quality < 5), average (5 < quality < 7), and above average ( quality >= 7)
The data points for average clutter the plots, therefore, they will be removed.
Much better. This plot shows a clear quality distinction between wines with higher fixed acidity and citric acid concentration. The opposite is true for volatile acidity.
It’s interesting that most of the wines with zero citric acid concentration are below average in quality. I at first thought that those values were probably empty cells or missing values but this strongly suggests otherwise. Citric acid and alcohol are the most useful variables if we were to try and predict the quality of the wine.
It’s clear that above average wines tend to have higher concentrations of citric acid and alcohol. Using this insight lets revisit the volatile acidity vs citric acid scatter plot but lets plot only wines with an alcohol content larger than 11.
Below average wines are mostly eliminated. Lets see what happens if we increase the alcohol content to 12.
There is only one below average wine left! Further, we can see that most of the wines left have a citric acid concentration above 0.2. Clearly alcohol content can be used to filter out low quality wines.
I originally planned to build a model to determine wine quality based on its properties. However, I came to the conclusion that there was not enough data to make accurate and reliable prediction. What’s possible is to set a few conditions that if followed will decrease the changes of selecting a below average wine. In this cases my recommendation would be to buy wines with an alcohol content of 12% or above, a concentration of citric acid higher than 0.2 g/L and of volatile acidity lower than 0.4 g/L. This may not be able to predict the quality of the wine but it comes close to assuring it won’t be below average.
This plot is essencial to understanding the limitations of our dataset. There are very few data on the lower and higher values of wine quality. Therefore, it is not plausible to develop a reliable prediction models and we must be careful when exploring the data not to confuse patterns with lack of data. The distribution of wine quality is normal.
Alcohol content and citric acid are the two variables that have the most impact on wine quality. As either of the two variables increases so does the average wine quality. The mean concentration difference between the lowest quality and highest quality categories for alcohol and citric acid are 2.1% and .22 g/L respectively.
## [1] "% Below Average"
## [1] 7.843137
## [1] "% Above Average"
## [1] 92.15686
This plot is the combination of the three main variables, it clearly shows that wines with low volatile acidity and relatively high citric acid and alcohol concentrations tend to be of higher quality. In fact only about 8 percent of wines that meet these parameters were of below average quality.
The red wine quality database contains information about the chemical properties of 1599 different red wines. I started my exploration of the data by getting a quick understanding of each individual property and then used than understanding to investigate their relationship with the quality of the wine and the other variables. I found the data to be lacking and decided against creating a predictive model for wine quality. Instead, I searched for the variables with the stronger correlation with wine quality to create parameters that could be used to minimize the probability of selecting a wine of below average quality.
There was a clear tendency for wine to increase in quality as alcohol and citric acid content incresed and volatile acidity decreased. I was surprised to see that residual sugars didn’t have much of an effect on the wine.
The main issue with this database was its size. With a much larger wine dataset we could better establish patterns and discover new ones. Also, all of the data comes from “Vinho Verde” wines, which come from the north of Portugal and is generally a young wine. It would be much more interesting to analyse data from wines all over the world and explore variables such as age, kind of grape, year of harvest, and region.